Stable Diffusion
Diffusion models (Introduced by Denoising Diffusion Probabilistic Models paper in 2020) introduced following way to generate an image:
- Forward process: A Markov chain that adds noise to the image for times. (usually )
- Reverse process: A U-Net or CNN that takes the noisy image and tries to predict the noise added to it. Previous step then becomes
Predicting the noise instead of the de-noised image allows for a simple loss function. Diffusion model is sometimes called a latent variable model, because we take an image and convert it to a latent version by adding a noise to it and try to recover it back.
This is more stable than GAN approach as we don't train two competing models. Instead we have a single model that predicts images starting from noise.
Linear schedule used to noise the image doesn't kill the signal completely. The tiny signal left in the image makes model struggle to generate pure black or pure white images. Modern implementations use "Zero Terminal SNR" that ensures signal is mathematically at the final step.
Prompting
Text encoding required for prompting comes from CLIP (Learning Transferable Visual Representations From Natural Language Supervision, Radford et al., 2021)
Conditioning mechanism that conditions image to text embeddings comes from GLIDE (Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models, Nichol et al., 2021).
Classifier-Free Guidance, (Ho & Salimans, 2012) amplifies this effect of this conditioning.
GLIDE uses cross-attention under the hood.